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Abstract 

We present an extensive empirical com- 
parison of several smoothing techniques in 
the domain of language modeling, includ- 
ing those described by Jelinek and Mer- 
cer (1980), Katz (1987), and Church and 
Gale (1991). We investigate for the first 
time how factors such as training data 
size, corpus {e.g., Brown versus Wall Street 
Journal), and n-gram order (bigram versus 
trigram) affect the relative performance of 
these methods, which we measure through 
the cross-entropy of test data. In addition, 
we introduce two novel smoothing tech- 
niques, one a variation of Jelinek-Mercer 
smoothing and one a very simple linear in- 
terpolation technique, both of which out- 
perform existing methods. 



1 Introduction 

Smoothing is a technique essential in the construc- 
tion of n-gram language models, a staple in speech 
recognition (Bahl, Jelinek, and Mercer, 1983) as well 
as many other domains (Church, 1988; Brown et al 



1990 ; jKernighan, Church, and Gale, 1990| ). A lan- 
guage model is a probability distribution over strings 
P(s) that attempts to reflect the frequency with 
which each string s occurs as a sentence in natu- 
ral text. Language models are used in speech recog- 
nition to resolve acoustically ambiguous utterances. 
For example, if we have that P(it takes two) 3> 
P{it takes too), then we know ceteris paribus to pre- 
fer the former transcription over the latter. 

While smoothing is a central issue in language 
modeling, the literature lacks a definitive compar- 
ison between the many existing techniques. Previ- 



a small number of methods (typically two) on a sin- 
gle corpus and using a single training data size. As 
a result, it is currently difficult for a researcher to 
intelligently choose between smoothing schemes. 

In this work, we carry out an extensive 
empirical comparison of the most widely used 
smoothing techniques, including those described 
by [Jelinek and Mercer (1980| ), |Katz (1987| ), and 



Church and Gale (1991). We carry out experiments 
over many training data sizes on varied corpora us- 
ing both bigram and trigram models. We demon- 
strate that the relative performance of techniques 
depends greatly on training data size and n-gram 
order. For example, for bigram models produced 
from large training sets Church-Gale smoothing has 
superior performance, while Katz smoothing per- 
forms best on bigram models produced from smaller 
data. For the methods with parameters that can 
be tuned to improve performance, we perform an 
automated search for optimal values and show that 
sub-optimal parameter selection can significantly de- 
crease performance. To our knowledge, this is the 
first smoothing work that systematically investigates 
any of these issues. 

In addition, we introduce two novel smooth- 
ing techniques: the first belonging to the class of 
smoothing models described by Jelinek and Mer- 
cer, the second a very simple linear interpolation 
method. While being relatively simple to imple- 
ment, we show that these methods yield good perfor- 
mance in bigram models and superior performance 
in trigram models. 

We take the performance of a method m to be its 
cross-entropy on test data 
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ous studies (Nadas, 1984; Katz, 1987; Church and 
Gale, 199lt [MacKay and Pcto, 1995[ ) only compare 



where P m (ti) denotes the language model produced 
with method m and where the test data T is com- 
posed of sentences (t\, . . . , ti T ) and contains a total 



of Nt words. The entropy is inversely related to 
the average probability a model assigns to sentences 
in the test data, and it is generally assumed that 



lower entropy correlates with better performance in 1948), yielding 



As an example, one simple smoothing technique is 
to pretend each bigram occurs once more than it ac- 
tually did ( Lidstone, 1920| ; Johnson, 1932; Jeffreys 



applications. 

1.1 Smoothing ra-gram Models 

In n-gram language modeling, the probability of a 
string P(s) is expressed as the product of the prob- 
abilities of the words that compose the string, with 
each word probability conditional on the identity of 
the last n — 1 words, i.e., if s = Wi ■ ■ ■ Wi we have 



P + l(Wi\Wi-i) 



c(Wj-iWj) + 1 

c(wi-x) + \V\ 



Wi\w\ r ) 



X\P{w t \w^l +1 ) 



(1) 



where w\ denotes the words uii ■ ■ -wj. Typically, n is 
taken to be two or three, corresponding to a bigram 
or trigram model, respectively.^] 

Consider the case n = 2. To estimate the proba- 
bilities P(iVi\wi-i) in equation (Q), one can acquire 
a large corpus of text, which we refer to as training 
data, and take 



where V is the vocabulary, the set of all words be- 
ing considered. This has the desirable quality of 
preventing zero bigram probabilities. However, this 
scheme has the flaw of assigning the same probabil- 
ity to say, burnish the and burnish thou (assuming 
neither occurred in the training data), even though 
intuitively the former seems more likely because the 
word the is much more common than thou. 

To address this, another smoothing technique is to 
interpolate the bigram model with a unigram model 
Pml{wi) — c(iVi)/Ns, a model that reflects how of- 
ten each word occurs in the training data. For ex- 
ample, we can take 



P 
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getting the behavior that bigrams involving common 
words are assigned higher probabilities (Jelinek and 



P(Wi 



Mercer, 1980) 



c{wi-i)/N s 

c{Wj-iWi) 

c(wi-i) 

where c{a) denotes the number of times the string 
a occurs in the text and Ns denotes the total num- 
ber of words. This is called the maximum likelihood 
(ML) estimate for P(wi\wi-x), 

While intuitive, the maximum likelihood estimate 
is a poor one when the amount of training data is 
small compared to the size of the model being built, 
as is generally the case in language modeling. For ex- 
ample, consider the situation where a pair of words, 
or bigram, say burnish the, doesn't occur in the 
training data. Then, we have PMh(the\burnish) = 0, 
which is clearly inaccurate as this probability should 
be larger than zero. A zero bigram probability can 
lead to errors in speech recognition, as it disallows 
the bigram regardless of how informative the acous- 
tic signal is. The term smoothing describes tech- 
niques for adjusting the maximum likelihood esti- 
mate to hopefully produce more accurate probabili- 
ties. 



2 Previous Work 

The simplest type o f smoothing used in practice is 
additive smoo thing ( Lidstone, 192C ; Johnson, 1932| ; 
Jeffreys, 1948 ), where we take 



c(wtzl +1 ) + S\V\ 



(2) 



and where Lidstone and Jeffreys advocate 5 = 1. 



Gale and Church (199C ; 1994 ) have argued that this 
method generally performs poo rly. 



The Good- Turing estimate ( |Good, 1953] ) is cen 



x To make the term P(wi\wlZn+i) meaningful for 
i < n, one can pad the beginning of the string with 
a distinguished token. In this work, we assume there are 
n — 1 such distinguished tokens preceding each sentence. 



tral to many smoothing techniques. It is not used 
directly for n-gram smoothing because, like additive 
smoothing, it does not perform the interpolation of 
lower- and higher-order models essential for good 
performance. Good- Turing states that an n-gram 
that occurs r times should be treated as if it had 
occurred r* times, where 

r = (r + 1) 

n r 

and where n r is the number of n-grams that occur 
exactly r times in the training data. 

Katz smoothing ( |1987| ) extends the intuitions of 
Good- Turing by adding the interpolation of higher- 
order models with lower-order models. It is perhaps 
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Figure 1: A values for old and new bucketing schemes for Jelinek-Mercer smoothing; each point represents a 
single bucket 



the most widely used smoothing technique in speech 
recognition. 

Church and Gale (1991 ) describe a smoothing 
method that combines the Good- Turing estimate 
with bucketing, the technique of partitioning a set 
of n-grams into disjoint groups, where each group 
is characterized independently through a set of pa- 
rameters. Like Katz, models are defined recursively 
in terms of lower-order models. Each n-gram is as- 
signed to one of several buckets based on its fre- 
quency predicted from lower-order models. Each 
bucket is treated as a separate distribution and 
Good- Turing estimation is performed within each, 
giving corrected counts that are normalized to yield 
probabilities. 

The other smoothing technique besides Katz 
smoothing widely used in speech recognition is due 
to Jelinek and Mercer (1980| ). They present a class 
of smoothing models that involve linear interpola- 
tion, e.g., Brown ct ah (1992) take 
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That is, the maximum likelihood estimate is inter- 
polated with the smoothed lower-order distribution, 
which is defined analogously. Training a distinct 
A »-i for each w*Z^ +1 is not generally felicitous; 



Bahl, Jelinek, and Mercer (1983 ) suggest partition- 



ing the A^i-i into buckets according to c{w\_ n+1 ) , 
where all A ;-i in the same bucket are constrained 

w i— n+l 

to have the same value. 

To yield meaningful results, the data used to esti- 



mate the A i-i need to be disjoint from the data 

w i- n+ i 

used to calculate Pml-0 In held-out interpolation, 
one reserves a section of the training data for this 
purpose. Alternatively, Jelinek and Mercer describe 
a technique called deleted interpolation where differ- 
ent parts of the training data rotate in training cither 
Pml or the A, ; the results are then averaged. 

Several smoothing techniques are motivated 
within a Bayesian framework, including work by 
Nadas (1984 ) and |MacKay and Peto (1995"| ). 



3 Novel Smoothing Techniques 

Of the great many novel methods that we have tried, 
two techniques have performed especially well. 

3.1 Method average-count 

This scheme is an instance of Jelinek-Mercer 
smoothing. Referring to equation (||), recall that 
Bahl et al. suggest bucketing the A ;-i according 

to c(«j]Z„_|_i)- We have found that partitioning the 
A »-i according to the average number of counts 



per non-zero element 



,+ l)>°l 



yields better 



\wi :c(w* _ 

results. 

Intuitively, the less sparse the data for estimat- 
ing Pmi/i^Iw^+i), the larger A w »-i j should be. 

While larger c(u^Z n _i_i) generally correspond to less 
sparse distributions, this quantity ignores the allo- 
cation of counts between words. For example, we 
would consider a distribution with ten counts dis- 
tributed evenly among ten words to be much more 



2 When the same data is used to estimate both, setting 
all A^i-i to one yields the optimal result. 



sparse than a distribution with ten counts all on a 
single word. The average number of counts per word 
seems to more directly express the concept of sparse- 
ness. 

In Figure [l], we graph the value of A assigned to 
each bucket under the original and new bucketing 
schemes on identical data. Notice that the new buck- 
eting scheme results in a much tighter plot, indicat- 
ing that it is better at grouping together distribu- 
tions with similar behavior. 



3.2 Method one-count 
This technique combines two intuitions. 



First, 

MacKay and Pcto (1995) argue that a reasonable 



form for a smoothed distribution is 
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The parameter a can be thought of as the num- 
ber of counts being added to the given distribution, 
where the new counts are distributed as in the lower- 
order distribution. Secondly, the Good- Turing esti- 
mate can be interpreted as stating that the number 
of these extra counts should be proportional to the 
number of words with exactly one count in the given 
distribution. We have found that taking 

« = 7KKt^ +1 )+/3] (4) 
works well, where 

™lK-n+l) = K : C«_„ +1 ) = 1| 

is the number of words with one count, and where (3 
and 7 are constants. 

4 Experimental Methodology 
4.1 Data 

We used the Penn treebank and TIPSTER cor- 
pora distributed by the Linguistic Data Consor- 
tium. From the treebank, we extracted text from 
the tagged Brown corpus, yielding about one mil- 
lion words. From TIPSTER, we used the Associ- 
ated Press (AP), Wall Street Journal (WSJ), and 
San Jose Mercury News (SJM) data, yielding 123, 
84, and 43 million words respectively. We created 
two distinct vocabularies, one for the Brown corpus 
and one for the TIPSTER data. The former vocab- 
ulary contains all 53,850 words occurring in Brown; 
the latter vocabulary consists of the 65,173 words 
occurring at least 70 times in TIPSTER. 

For each experiment, we selected three segments 
of held-out data along with the segment of train- 
ing data. One held-out segment was used as the 



test data for performance evaluation, and the other 
two were used as development test data for opti- 
mizing the parameters of each smoothing method. 
Each piece of held-out data was chosen to be roughly 
50,000 words. This decision does not reflect practice 
very well, as when the training data size is less than 
50,000 words it is not realistic to have so much devel- 
opment test data available. However, we made this 
decision to prevent us having to optimize the train- 
ing versus held-out data tradeoff for each data size. 
In addition, the development test data is used to op- 
timize typically very few parameters, so in practice 
small held-out sets are generally adequate, and per- 
haps can be avoided altogether with techniques such 
as deleted estimation. 

4.2 Smoothing Implementations 

In this section, we discuss the details of our imple- 
mentations of various smoothing techniques. Due 
to space limitations, these descriptions are not com- 
pr ehensive; a m ore complete discussion is presented 
in |Chcn (1996 ). The titles of the following sections 
include the mnemonic we use to refer to the imple- 
mentations in later sections. Unless otherwise speci- 
fied, for those smoothing models defined recursively 
in terms of lower-order models, we end the recursion 
by taking the n — distribution to be the uniform 
distribution P un if(wi) — 1/|V|. For each method, we 
highlight the parameters (e.g., X n and 8 below) that 
can be tuned to optimize performance. Parameter 
values are determined through training on held-out 
data. 

4.2.1 Baseline Smoothing (interp-baseline) 

For our baseline smoothing method, we use an 
instance of Jelinek-Mercer smoothing where we con- 
strain all A, 1-1 to be equal to a single value A„ for 
each n, i.e., 



K_i+i) = 



A„ PML(Wi\wl_l +1 ) + 

(1 - An) -PbascKKl^+a) 



4.2.2 Additive Smoothing (plus-one and 
plus-delta) 

We consider two versions of additive smoothing. 
Referring to equation (||), we fix 8 = 1 in plus-one 
smoothing. In plus-delta, we consider any 5. 

4.2.3 Katz Smoothing (katz) 



While the original paper ( Katz, 1987 ) uses a single 
parameter k, we instead use a different k for each 
n > 1, k n . We smooth the unigram distribution 
using additive smoothing with parameter 8. 



4.2.4 Church-Gale Smoothing 

(church-gale) 

To smooth the counts n r needed for the Good- 
Turing estimate, we use t he technique described by 
Gale and Sampson (1995 ). We smooth the unigram 
distribution using Good- Turing without any bucket- 
ing. 

Instead of the bucketing scheme described in the 
original paper, we use a scheme analogous to the 



one described by Bahl, Jclinek, and Mercer (1983) 



We make the assumption that whether a bucket is 
large enough for accurate Good- Turing estimation 
depends on how many n-grams with non-zero counts 
occur in it. Thus, instead of partitioning the space 
of P(wi-i)P(iUi) values in some uniform way as was 
done by Church and Gale, we partition the space 
so that at least c min non-zero n-grams fall in each 
bucket. 

Finally, the original paper describes only bigram 
smoothing in detail; extending this method to tri- 
gram smoothing is ambiguous. In particular, it is 
unclear whether to bucket trigrams according to 
P{w\z\)P{wi) ovP{w\ zl)P(wi\wi-i). We chose the 
former; while the latter may yield better perfor- 
mance, our belief is that it is much more difficult 
to implement and that it requires a great deal more 
computation. 

4.2.5 Jelinek-Mercer Smoothing 
(interp-held-out and interp-del-int) 

We implemented two versions of Jelinek-Mercer 
smoothing differing only in what data is used to 
train the A's. We bucket the A >-i according to 

c(wlZn + i) as suggested by Bahl et al. Similar to our 
Church-Gale implementation, we choose buckets to 
ensure that at least c mm words in the data used to 
train the A's fall in each bucket. 

In interp-held-out, the A's are trained using 
held-out interpolation on one of the development 
test sets. In interp-del-int, the A's are trained 
using the relaxed deleted interpolation technique de- 
scribed by Jelinek and Mercer, where one word is 
deleted at a time. In interp-del-int, we bucket 
an n-gram according to its count before deletion, as 
this turned out to significantly improve performance. 

4.2.6 Novel Smoothing Methods 
(new-avg-count and new-one-count) 

The implementation new-avg-count, correspond- 
ing to smoothing method average-count, is identical 
to interp-held-out except that we use the novel 
bucketing scheme described in section 3.1. In the 



implementation new-one-count, we have different 
parameters [3 n and 7„ in equation (H) for each n. 



5 Results 

In Figure we display the performance of the 
interp-baseline method for bigram and trigram 
models on TIPSTER, Brown, and the WSJ subset 
of TIPSTER. In Figures §-§, we display the relative 
performance of various smoothing techniques with 
respect to the baseline method on these corpora, as 
measured by difference in entropy. In the graphs 
on the left of Figures ||-||, each point represents an 
average over ten runs; the error bars represent the 
empirical standard deviation over these runs. Due 
to resource limitations, we only performed multiple 
runs for data sets of 50,000 sentences or less. Each 
point on the graphs on the right represents a sin- 
gle run, but we consider sizes up to the amount of 
data available. The graphs on the bottom of Fig- 
ures ||-|| are close-ups of the graphs above, focusing 
on those algorithms that perform better than the 
baseline. To give an idea of how these cross-entropy 
differences translate to perplexity, each 0.014 bits 
correspond roughly to a 1% change in perplexity. 

In each run except as noted below, optimal val- 
ues for the parameters of the given technique were 
searched for using Powell's search algorithm as real- 
ized in Numerical Recipes in C ( Press et al., 198§| , 
pp. 309-317). Parameters were chosen to optimize 
the cross-entropy of one of the development test sets 
associated with the given training set. To constrain 
the search, we searched only those parameters that 
were found to affect performance significantly, as 
verified through preliminary experiments over sev- 
eral data sizes. For katz and church-gale, we did 
not perform the parameter search for training sets 
over 50,000 sentences due to resource constraints, 
and instead manually extrapolated parameter val- 
ues from optimal values found on smaller data sizes. 
We ran interp-del-int only on sizes up to 50,000 
sentences due to time constraints. 

From these graphs, we see that additive smooth- 
ing performs poorly and that methods katz and 
interp-held-out consistently perform well. Our 
implementation church-gale performs poorly ex- 
cept on large bigram training sets, where it performs 
the best. The novel methods new-avg-count and 
new-one-count perform well uniformly across train- 
ing data sizes, and are superior for trigram models. 
Notice that while performance is relatively consis- 
tent across corpora, it varies widely with respect to 
training set size and n-gram order. 

The method interp-del-int performs signifi- 
cantly worse than interp-held-out, though they 
differ only in the data used to train the A's. However, 
we delete one word at a time in interp-del-int; we 




Figure 2: Baseline cross-entropy on test data; graph on left displays averages over ten runs for training sets 
up to 50,000 sentences, graph on right displays single runs for training sets up to 10,000,000 sentences 
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Figure 3: Trigram model on TIPSTER data; relative performance of various methods with respect to baseline; 
graphs on left display averages over ten runs for training sets up to 50,000 sentences, graphs on right display 
single runs for training sets up to 10,000,000 sentences; top graphs show all algorithms, bottom graphs zoom 
in on those methods that perform better than the baseline method 
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Figure 4: Bigram model on TIPSTER data; relative performance of various methods with respect to baseline; 
graphs on left display averages over ten runs for training sets up to 50,000 sentences, graphs on right display 
single runs for training sets up to 10,000,000 sentences; top graphs show all algorithms, bottom graphs zoom 
in on those methods that perform better than the baseline method 
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Figure 5: Bigram and trigram models on Brown corpus; relative performance of various methods with respect 
to baseline 




Figure 6: Bigram and trigram models on Wall Street Journal corpus; relative performance of various methods 
with respect to baseline 
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Figure 7: Performance of katz and new-avg-count with respect to parameters 5 and c m i n , respectively 



hypothesize that deleting larger chunks would lead 
to more similar performance. 

In Figure [?], we show how the values of the pa- 
rameters S and Cmin affect the performance of meth- 
ods katz and new-avg-count, respectively, over sev- 
eral training data sizes. Notice that poor parameter 
setting can lead to very significant losses in perfor- 
mance, and that optimal parameter settings depend 
on training set size. 

To give an informal estimate of the difficulty of 
implementation of each method, in Table [l] we dis- 
play the number of lines of CH — h code in each imple- 
mentation excluding the core code common across 
techniques. 

3 To implement the baseline method, we just used the 
interp-held-out code as it is a special case. Written 
anew, it probably would have been about 50 lines. 



Method 


Lines 


interp-baseline 3 


400 


plus-one 


40 


plus-delta 


40 


katz 


300 


church-gale 


1000 


interp-held-out 


400 


interp-del-int 


400 


new-avg-count 


400 


new-one-count 


50 



Table 1: Implementation difficulty of various meth- 
ods in terms of lines of C++ code 



6 Discussion 

To our knowledge, this is the first empirical compari- 
son of smoothing techniques in language modeling of 
such scope: no other study has used multiple train- 
ing data sizes, corpora, or has performed parameter 
optimization. We show that in order to completely 
characterize the relative performance of two tech- 
niques, it is necessary to consider multiple training 
set sizes and to try both bigram and trigram mod- 
els. Multiple runs should be performed whenever 
possible to discover whether any calculated differ- 
ences are statistically significant. Furthermore, we 
show that sub-optimal parameter selection can also 
significantly affect relative performance. 

We find that the two most widely used techniques, 
Katz smoothing and Jelinek-Mercer smoothing, per- 
form consistently well across training set sizes for 
both bigram and trigram models, with Katz smooth- 
ing performing better on trigram models produced 
from large training sets and on bigram models in 
general. These results question the generality of the 
previous reference result concerning Katz smooth- 
ing: Katz (1987) reported that his method slightly 
outperforms an unspecified version of Jelinek-Mercer 
smoothing on a single training set of 750,000 words. 
Furthermore, we show that Church-Gale smooth- 
ing, which previously had not been compared with 
common smoothing techniques, outperforms all ex- 
isting methods on bigram models produced from 
large training sets. Finally, we find that our novel 
methods average-count and one-count are superior 
to existing methods for trigram models and perform 
well on bigram models; method one-count yields 
marginally worse performance but is extremely easy 
to implement. 

In this study, we measure performance solely 
through the cross-entropy of test data; it would 
be interesting to see how these cross-entropy differ- 
ences correlate with performance in end applications 
such as speech recognition. In addition, it would be 
interesting to see whether these results extend to 
fields other than language modeling where smooth- 
ing is used, such as prepositional phrase attachment 



(Collins and Brooks, 199£ ) , part-of-speech tagging 



(Church, 1988), and stochastic parsing (Magerman 
1994|). 
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